This report performs differential gene expression analysis and basic over-representation (thresholded gene set enrichment) analysis on expression data retrieved from GSE129943. The publication that produced this dataset can be accessed on the Cell website (Dvela-Levitt et al., 2019).This report builds on top of “Assignment 1: Dataset Selection and Initial Processing,” which can be downloaded and read here. The previous report cleaned the bulk RNASeq data from GSE129943, removing any low-count and duplicated genes. Additionally, the previous report performed TMM normalization on the read counts, visualized the normalized counts using MDS plots, mean-variance plots, and distribution curves. The final dataset produced by the report had 14,241 genes and six samples, three of which are replicates of normal human kidney cells and three of which are replicates of Muc1 Kidney Disease (MKD) cells. This report will be comparing the gene expression between normal kidney cells and MKD cells to identify enriched pathways, so as to better understand the mechanisms of MKD.
This R Notebook was developed by Evgeniya Gorobets as part of an assessment for 2022 BCB420H: Computational Systems Biology, University of Toronto, Toronto, CA. Specifically, this notebook was the final submitted product for Assignment 2 of the course. The journal entries accompanying this R Notebook are listed below.
This report was created in R (R Core Team, 2021) with the knitr Xie (2017) and kableExtra packages (Zhu, 2021).
edgeR (Chen et al., 2016; McCarthy et al., 2012; Robinson et al., 2010) was
used to estimate dispersion and perform differential gene expression analysis.
ggplot2 (Wickham, 2016) was used to create the volcano plots in this report.
ComplexHeatmap (Gu et al., 2016) and circlize (Gu et al., 2014) were used to
create the heatmaps in this report. The gprofiler2 package (Kolberg et al., 2020),
which is built on top of the g:Profiler API (Raudvere et al., 2019), was used to
perform over-representation analysis. ORA plots were generated with
gprofiler2 (Kolberg et al., 2020) and plotly (Sievert, 2020). The report was created inside
a Docker image (Merkel, 2014) whose base is Rocker (Boettiger & Eddelbuettel, 2017).
# Install required packages
if (!requireNamespace("kableExtra", quietly = TRUE)) {
install.packages("kableExtra")
}
if (!requireNamespace("edgeR", quietly = TRUE)) {
install.packages("edgeR")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if(!requireNamespace("gprofiler2", quietly = TRUE)) {
install.packages("gprofiler2")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
install.packages("ComplexHeatmap")
}
# circlize::colorRamp2 required for continuous color scaling in ComplexHeatmap
if (!requireNamespace("circlize", quietly = TRUE)) {
install.packages("circlize")
}
if (!requireNamespace("plotly", quietly = TRUE)) {
install.packages("plotly")
}
dataDir <- '../data'
# Source A1 to get normalized counts
a1 <- knitr:: knit_child('../Assignment1/Assignment1.Rmd', quiet=T)
expressionSet <- normalizedCounts
The MDS plot from Assignment 1 shows that all the MKD samples were highly clustered in both dimensions and all the control (normal kidney) samples were clustered in one dimension. Additionally, the publication that produced this data set (Dvela-Levitt et al., 2019) stated that there was no biological relation between the control and test samples. Therefore, the only factor in the design matrix is whether each sample is from a normal kidney or an MKD patient.
# Convert the normalized counts into a matrix
counts <- expressionSet[3:8]
rownames(counts) <- expressionSet$Ensembl
# Create a data.frame containing the type (normal/MKD) of each sample
sampleTypes <- unlist(lapply(colnames(counts), function(sampleName) {
strsplit(sampleName, "_")[[1]][1]
}))
samples <- data.frame(sample = colnames(counts), type = sampleTypes)
# Create a design matrix
designMatrix <- model.matrix(~ samples$type)
kableExtra::kable_styling(knitr::kable(
designMatrix, type="html",
col.names = c("Intercept", "Samples.TypeP1A8"),
caption=paste("The design matrix for the GSE129943 dataset.",
"The Samples.TypeP1A8 column represents which of",
"the samples is derived from an MKD patient.")),
"condensed")
| Intercept | Samples.TypeP1A8 |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
The design matrix shown in Table 1 clearly demonstrates that three of the samples are of type P1A8 (i.e., MKD cells).
Now that the design matrix is defined, I can proceed with the differential
gene expression analysis. I will use the negative binomial generalized
linear model and the quasi-likelihood test, both implemented in the edgeR
package (Chen et al., 2016; McCarthy et al., 2012; Robinson et al., 2010).
I chose to use the edgeR package because it was designed specifically
for bulk RNASeq data.
# Set up a DGEList using our counts matrix and design matrix
counts <- as.matrix(counts)
dge <- edgeR::DGEList(counts=counts, group=samples$type)
dge <- edgeR::estimateDisp(dge, designMatrix)
# Fit the model using glmGLFit
fit <- edgeR::glmQLFit(dge, designMatrix)
# Compute differential expression and extract results into data.frame
qlfResults <- edgeR::glmQLFTest(fit, coef='samples$typeP1A8')
The next step is to identify the genes that were significantly differentially expressed. We are going to explore the number of DEGs under four different thresholds (0.05, 0.01, 0.005, 0.001) so that we may choose the p-value that is most appropriate for our data.
pvals <- c(0.05, 0.01, 0.005, 0.001)
numSignificant <- unlist(lapply(pvals, function(pval) {
length(which(qlfResults$table$PValue < pval))
}))
kableExtra::kable_styling(knitr::kable(
data.frame(`P-Value`=pvals, `Num. Significant Genes`=numSignificant,
check.names=F),
caption=paste("The number of significantly differentially",
"expressed genes in the GSE129943 dataset, based",
"on four different p-value thresholds."),
type="html"), "condensed")
| P-Value | Num. Significant Genes |
|---|---|
| 0.050 | 5597 |
| 0.010 | 4237 |
| 0.005 | 3851 |
| 0.001 | 3072 |
Table 2 shows that there are a large amount of significant DEGs, even when using a p-value as small as 0.001. This suggests a stronger threshold should be used to generate the thresholded gene lists for gene set enrichment analysis. However, before deciding a threshold, we must first adjust our p-values to correct for Multiple Hypothesis Testing.
Because we are testing for differential expression in thousands of genes (14,241 genes, to be precise), and each gene has its own null hypothesis that is being tested ( $H_0^{(X)} = $ gene \(X\) is not differentially expressed), we must consider the effects of Multiple Hypothesis Testing. Because we are testing so many hypotheses simultaneously, there is a much greater chance of false positives (Type I errors). In other words, there is a higher probability that a gene that is not differentially expressed will still have a p-value less than the significance threshold.
To correct for this, I will use the Benjamini-Hochberg procedure to compute a BH-adjusted p-value, also known as the False Discovery Rate (FDR). I am the Benjamini-Hochberg adjustment method because it is the most commonly used in recent literature and because it is less stringent than other correction methods (e.g., Bonferroni).
As above, we’re going to find the number of genes with significant differential expression under four different thresholds: 0.05, 0.01, 0.005, 0.001.
qlfAdjusted <- edgeR::topTags(qlfResults, n=nrow(counts), adjust.method="BH",
sort.by="PValue")
numSigAdj <- unlist(lapply(pvals, function(pval) {
length(which(qlfAdjusted$table$FDR < pval))
}))
kableExtra::kable_styling(knitr::kable(
data.frame(`Adjusted P-Value (FDR)`=pvals, `Num. Significant Genes`=numSigAdj,
check.names=F),
caption=paste("The number of significantly differentially",
"expressed genes in the GSE129943 dataset after",
"Benjaminig-Hochberg correction, based on four",
"different FDR thresholds."),
type="html"), "condensed")
| Adjusted P-Value (FDR) | Num. Significant Genes |
|---|---|
| 0.050 | 4602 |
| 0.010 | 3457 |
| 0.005 | 3103 |
| 0.001 | 2486 |
Even with adjustment for Multiple Hypothesis Testing, more than 2000 genes had a significant change in expression, even with the strictest threshold of 0.001 (Table 3).
Let us consider two thresholds (0.05 and 0.001) and the DEG list that would be
produced by each threshold. The DEGs produced by the stricter threshold (0.001)
would have stronger signals, so the gene set enrichment analysis would reveal
only the most strongly up-regulated or down-regulated pathways. This may help
identify the most important biological processes in MKD cells, but may miss out
on weakly-regulated pathways that still offer key insights to the mechanisms of
MKD cells. On the other hand, gene set enrichment analysis on the DEG list
produced by the weaker threshold (0.05) will likely reveal more pathways, but
may fail to highlight the most important ones. I believe that it is important
to conduct gene set enrichment analysis on both DEG lists, to get both a “deep”
view and a “broad” view of the enriched pathways. Therefore, I will subset the
qlfResults based on an adjusted p-value of 0.05 and 0.001. Since enough genes
pass correction, moving forward I will only consider genes with a significant
FDR (i.e., I will not subset or analyze based on the un-adjusted p-values).
# Subset QLF results based on both p-values
sigResults <- qlfAdjusted$table[qlfAdjusted$table$FDR < 0.05, ]
verySigResults <- qlfAdjusted$table[qlfAdjusted$table$FDR < 0.001, ]
# Get the up and down-regulated genes
upRegGenes <- rownames(sigResults[sigResults$logFC > 0, ])
veryUpRegGenes <- rownames(verySigResults[verySigResults$logFC > 0, ])
downRegGenes <- rownames(sigResults[sigResults$logFC < 0, ])
veryDownRegGenes <- rownames(verySigResults[verySigResults$logFC < 0, ])
# Save all the gene lists
geneLists <- list("upreg_genelist_01" = upRegGenes,
"downreg_genelist_01" = downRegGenes,
"upreg_genelist_001" = veryUpRegGenes,
"downreg_genelist_001" = veryDownRegGenes)
for (geneList in names(geneLists)) {
write.table(geneLists[[geneList]],
file=paste0(dataDir, "/", geneList, ".txt"),
quote=F, sep="\n", row.names=F, col.names=F)
}
First, we will use a volcano plot to visualize our DEGs relative to our full gene list.
# Copy relevant data from QLF Test results
fdrs <- qlfAdjusted$table[, c("logFC", "FDR")]
# Add a column that indicates whether each gene is not a DEG, a DEG but only
# with threshold=0.05 (significant), or a DEG with threshold=0.001 (very sig.)
# and whether it is up/down regulated
# NOTE: use strings instead of numbers so ggplot2 knows it is discrete
fdrs$significance <- "0"
fdrs$significance[fdrs$FDR < 0.05 & fdrs$logFC > 0] <- "11"
fdrs$significance[fdrs$FDR < 0.05 & fdrs$logFC < 0] <- "12"
fdrs$significance[fdrs$FDR < 0.001 & fdrs$logFC > 0] <- "21"
fdrs$significance[fdrs$FDR < 0.001 & fdrs$logFC < 0] <- "22"
# Add a column with log10(FDR)
fdrs$logFDR <- -log10(fdrs$FDR)
# Build volcano plot
plot <- ggplot2::ggplot(fdrs, ggplot2::aes(x=logFC, y=logFDR, col=significance))
# Add scatter points and color appropriately
plot <- plot + ggplot2::geom_point() +
ggplot2::scale_color_manual(
name = "Significance and\nExpression Change",
labels=c("FDR >= 0.05",
"0.001 <= FDR < 0.05; upregulated",
"0.001 <= FDR < 0.05; downregulated",
"FDR < 0.001; upregulated",
"FDR < 0.001; downregulated"),
values=c("gray", "pink", "turquoise2", "red", "blue"))
# Define styling and significance lines and add them to the plot
plotStyle <- ggplot2::theme(
plot.title = ggplot2::element_text(hjust = 0.5),
panel.background = ggplot2::element_rect(fill="white", color="black"),
panel.grid.major = ggplot2::element_line(colour="#eeeeee"),
panel.grid.minor = ggplot2::element_line(colour="#eeeeee"))
fdr05Line <- ggplot2::geom_hline(yintercept=-log10(0.05), linetype='dotted',
col = 'black', size=1)
fdr001Line <- ggplot2::geom_hline(yintercept=-log10(0.001), linetype='dotted',
col = 'black', size=1)
plot <- plot + plotStyle + fdr05Line + fdr001Line
# Add title, axis labels
xLabel <- "Log2 FC"
yLabel <- "-Log10 FDR"
title <- "DEGs in GSE129943 (-logFDR vs. logFC)"
plot <- plot +
ggplot2::ggtitle(title) + ggplot2::ylab(yLabel) + ggplot2::xlab(xLabel)
plot
Figure 1: A volcano plot showing the expression change (log2FC) and significance of the expression change after BH correction (-log10FDR) for all genes in the cleaned GSE129943 dataset.
The volcano plot in Figure 1 is fairly symmetrical, indicating that there are roughly an equal number of up-regulated and down-regulated genes in MKD cells (relative to normal kidney cells). This plot also visualizes what was already demonstrated numerically: that the number of very significantly (FDR < 0.001) differentially expressed genes is quite large (2486 genes out of 14,241, i.e., an astonishing 17.46%). Moreover, more than half of significantly regulated genes (FDR < 0.05) are very significantly regulated (2486 genes out of 4602 = 54%); this is shown by the proportion of red dots to pink dots on the volcano plot.
There are three genes highlighted in the study that produced this data: MUC1, ATF6, and TMED9 (Dvela-Levitt et al., 2019). MUC1 is the gene responsible for MKD (MKD = Mucin 1 Kidney Disease) (Dvela-Levitt et al., 2019). A frame shift in MUC1 (producing the mutant MUC1-fs allele) causes a misfolded version of the Mucin 1 protein to accumulate in cells (Dvela-Levitt et al., 2019). ATF6 is the gene that encodes Activating Transcription Factor 6. This TF activates one of the three branches of unfolded protein response (UPR), which is a cellular response that helps clear misfolded proteins from the cell and maintain ER homeostasis (Dvela-Levitt et al., 2019). In the original study, the authors compared this RNASeq data with a published transcriptional signature of ATF6-specific UPR activation (Adamson et al., 2016) and concluded that the ATF6 branch is significantly up-regulated (Dvela-Levitt et al., 2019). The final gene of interest is TMED9, which encodes a cargo receptor in the p24 family that is found on vesicles in the early secretory pathway (Dvela-Levitt et al., 2019). According to the Dvela-Levitt et al., TMED9-containing vesicles trap misfolded MUC1 proteins, preventing them from lysosomic degradation; thus, TMED9 plays a key role in the development of MKD (Dvela-Levitt et al., 2019). In the study, Dvela-Levitt et al. performed immuno-fluorescence (IF) microscopy on normal kidney organoids and MKD patient-derived organoids and found that TMED9 was significantly more abundant in patient-derived organoids (Dvela-Levitt et al., 2019).
Because of the critical roles of MUC1, ATF6, and TMED9 in MKD, it is important to visualize these genes in our normalized expression data.
# Get Ensembl gene IDs for each gene of interest
muc1 <- expressionSet$Ensembl[which(expressionSet$HGNC == "MUC1")]
atf6 <- expressionSet$Ensembl[which(expressionSet$HGNC == "ATF6")]
tmed9 <- expressionSet$Ensembl[which(expressionSet$HGNC == "TMED9")]
# Label the genes of interest so that they have their own colour
fdrs$`Genes of Interest` <- "0"
fdrs[muc1, "Genes of Interest"] <- "1"
fdrs[atf6, "Genes of Interest"] <- "2"
fdrs[tmed9, "Genes of Interest"] <- "3"
# Sort so that the colored points are plotted last
fdrs <- fdrs[order(fdrs$`Genes of Interest`, decreasing = F), ]
# Build new volcano plot
plot <- ggplot2::ggplot(fdrs, ggplot2::aes(x=logFC, y=logFDR,
col=`Genes of Interest`,
alpha=`Genes of Interest`))
# Add scatter points and color appropriately
plot <- plot + ggplot2::geom_point() +
ggplot2::scale_color_manual(labels=c("Other genes", "MUC1", "ATF6", "TMED9"),
values=c("gray", "red", "blue", "green")) +
ggplot2::scale_alpha_manual(labels=c("Other genes", "MUC1", "ATF6", "TMED9"),
values=c(0.1, 1.0, 1.0, 1.0))
# Add threshold lines and styling
plot <- plot + plotStyle + fdr05Line + fdr001Line
# Add title, axis labels
xLabel <- "Log2 FC"
yLabel <- "-Log10 FDR"
title <- "DEGs in GSE129943 (-logFDR vs. logFC)"
plot <- plot +
ggplot2::ggtitle(title) + ggplot2::ylab(yLabel) + ggplot2::xlab(xLabel)
plot
Figure 2: A volcano plot showing the expression change (log2FC) and significance of the expression change after BH correction (-log10FDR) of MUC1, ATF6, and TMED9 relative to all other genes in the cleaned GSE129943 dataset.
The volcano plot in Figure 2 reveals that each gene of interest had a minimal, insignificant change in expression. To get more insight, we will also look at the unadjusted p-values for each of these genes.
kableExtra::kable_styling(
knitr::kable(
qlfAdjusted$table[c(muc1, atf6, tmed9), c("logFC", "PValue", "FDR")],
caption="Log fold change, p-value, and FDR of MUC1, ATF6, and TMED9.",
type="html"),
"condensed")
| logFC | PValue | FDR | |
|---|---|---|---|
| ENSG00000185499 | 0.2084175 | 0.4502066 | 0.6360508 |
| ENSG00000118217 | 0.1932326 | 0.1067418 | 0.2290702 |
| ENSG00000184840 | 0.1220374 | 0.1363541 | 0.2755526 |
The unadjusted p-values for each of these genes was greater than 0.1, and therefore don’t pass even the weakest threshold of significance (Table 4). It is unclear whether the RNASeq technology used would differentiate between the MUC1-fs (frame shift) and MUC1-wt (wild type) transcripts. Assuming both types of transcripts get classified as MUC1, then we would not expect a significant difference in MUC1 expression between MKD cells and normal kidney cells (since the MUC1-fs mutation does not affect the transcription of the gene). Therefore, the very high p-value and relatively low log fold change for MUC1 is not surprising.
On the other hand, the low log fold changes for ATF6 and TMED9 are unexpected. Although differential expression analysis alone is not enough to confirm whether the entire ATF6 branch of UPR was up-regulated in MKD cells, one would expect that at least the ATF6 gene would be significantly up-regulated. However, this is not the case, so further investigation via gene set enrichment analysis is required.
For TMED9, the authors of the study made no claim about transcriptional up-regulation; they only observed increased protein abundance in MKD patient-derived organoids (Dvela-Levitt et al., 2019). However, one would expect the TMED9 gene to be significantly up-regulated in MKD cells in order to produce the observed TMED9 protein abundance. Again, further insight is required to determine why the results of this differential gene expression analysis seem to contradict the results of the study.
Next, we will examine our expression set with a series of heatmaps.
plotHeatmap <- function(counts, title) {
# Row normalization (scale each gene)
# NOTE: need to use t() because scale operates on columns
scaledCounts <- t(scale(t(counts)))
# Define colours
breaks <- c(min(scaledCounts), 0, max(scaledCounts))
colors <- circlize::colorRamp2(breaks=breaks,
colors=c("red", "white", "blue"))
# NOTE: use column_title as the main title
heatmap <- ComplexHeatmap::Heatmap(
scaledCounts, col=colors, show_column_names=T, show_row_names=F,
show_row_dend=T, show_column_dend=T, column_names_rot=45,
row_title="Genes", column_title=title,
row_title_side="left", column_title_side = "top",
heatmap_legend_param=list(title = "Gene-Scaled Counts"))
return(heatmap)
}
plotHeatmap(counts, paste0("Heatmap of Normalized and Gene-Scaled\n",
"RNASeq Counts in GSE129943 (all genes)"))
Figure 3: A heatmap showing the normalized and gene-scaled expression change of all genes in the cleaned GSE129943 dataset. Each row represents a gene, each column represents a biological sample, and both genes and samples are clustered together based on the expression data.
plotHeatmap(counts[rownames(sigResults), ],
paste0("Heatmap of Normalized and Gene-Scaled\n",
"RNASeq Counts in GSE129943 (FDR < 0.05)"))
plotHeatmap(counts[rownames(verySigResults), ],
paste0("Heatmap of Normalized and Gene-Scaled\n",
"RNASeq Counts in GSE129943 (FDR < 0.001)"))
Figure 4: Two heatmap showing the normalized and gene-scaled expression change of significant DEGs in the cleaned GSE129943 dataset. The left heatmap includes all DEGs with FDR < 0.05. The right heatmap includes all DEGs with FDR < 0.001. Each row represents a gene, each column represents a biological sample, and both genes and samples are clustered together based on the expression data.
The first heatmap (Figure 3), has a visible signal differentiating MKD cells and normal kidney cells, despite inlcuding non-diffentially expressed genes. In the column dendogram, we observe that the MKD samples cluster together and the normal kidney samples cluster together, reflecting the strong separation we saw earlier in the MDS plot. When only DEGs are mapped (FDR < 0.05; Figure 4, left), the signal in the heatmap becomes a lot stronger. It looks as though each gene is either up-regulated in all the normal kidney samples and down-regulated in all the MKD samples, or vice-versa. The last heatmap (Figure 4, right), which includes only genes with FDR < 0.001, is very similar to the second. It shows the same strong signal, but has fewer white-colored or pale-colored cells. In other words, the final heatmap only shows genes that are strongly up- or down-regulated.
I will use Fisher’s Exact Test for my over-representation analysis. There are two main reasons why we are using Fisher’s Exact Test instead of a binomial or chi-squared test:
As stated above, there are multiple platforms that implement Fisher’s Exact Test in the context of thresholded gene set enrichment. I will use g:Profiler as my ORA tool because (a) it is frequently updated, with updates occurring approximately every three months, following quarterly releases of Ensembl databases; (b) it contains human gene set annotation from all the major annotation datasets (GO, Reactome, KEGG, WikiPathways), (c) it accepts ENSEMBL gene IDs in its gene list, which is the primary type of identifier I am using; (d) it has an accompanying R package, so I can query their database programatically; and (e) it is the platform I am most familiar with (Kolberg et al., 2020; Raudvere et al., 2019).
To get the most insight into our data and retrieve all possible significant pathways, I will use all the pathway annotation datasets available in g:Profiler. This includes GO Biological Processes “The Gene Ontology Resource” (2021), KEGG pathways Kanehisa et al. (2021), Reactome data (Gillespie et al., 2022), and WikiPathways (Martens et al., 2021). I will not include electronic GO annotations because my thresholded gene lists are sufficiently long and should produce results even without electronic annotations.
The version of these four annotation datasets are listed in Table 5.
annotationData <- c("GO:BP", "KEGG", "REAC", "WP")
# Get version info for all annotation datasets
versions <- gprofiler2::get_version_info(organism = "hsapiens")
# Format into a readable table
versionsDf <- as.data.frame(do.call(rbind, versions$sources[annotationData]))
versionsDf$version <- gsub("\\n", "; ", versionsDf$version)
colnames(versionsDf) <- c("Annotation Dataset Name", "Version")
kableExtra::kable_styling(
knitr::kable(versionsDf,
caption=paste("Version information for GO, KEGG, Reactome, and",
"WikiPathways pathway annotations"), type="html"),
"condensed"
)
| Annotation Dataset Name | Version | |
|---|---|---|
| GO:BP | biological process | annotations: BioMart; classes: releases/2021-12-15 |
| KEGG | Kyoto Encyclopedia of Genes and Genomes | KEGG FTP Release 2021-12-27 |
| REAC | Reactome | annotations: BioMart; classes: 2022-1-3 |
| WP | WikiPathways | 20211210 |
GO and Reactome annotations rely on BioMart releases, so I will also note the BioMart version used in my over-rerpresentation analysis.
cat(paste("BioMart Version:", versions$biomart, versions$biomart_version))
## BioMart Version: Ensembl 105
First, I define two helper functions to help with my gene set enrichment analysis.
getGeneSets <- function(queryGenes, annotationSources, backgroundGenes=NULL) {
# Determine whether the background for Fisher's test will be custom
if (is.null(backgroundGenes)) {
domainScope <- "annotated"
} else {
domainScope <- "custom"
}
return(
gprofiler2::gost(query = queryGenes,organism = "hsapiens",
ordered_query = F, multi_query = F, significant = F,
exclude_iea = T, measure_underrepresentation = F,
evcodes = F, user_threshold = 0.05,
correction_method = "fdr", domain_scope = domainScope,
custom_bg = backgroundGenes, sources = annotationSources)
)
}
getTopGeneSets <- function(oraDf, annotationSource, numSets=10) {
results <- oraDf[oraDf$source == annotationSource, ]
topResults <- results[1:numSets, c("source", "term_id", "term_name",
"term_size", "p_value") ]
colnames(topResults) <- c("Source", "Term ID", "Term Name", "Term Size",
"P-Value")
return(topResults)
}
I am choosing a significance threshold of 0.05 because it is the most common threshold to use and because I want to fetch as many gene sets as possible. I am again using the Benjamini-Hochberg procedure to correct for multiple hypothesis testing, for the same reasons described in Multiple Hypothesis Testing.
I will start by performing ORA on significantly regulated genes (DEG threshold: FDR < 0.05).
bgGenes <- rownames(counts)
oraFile <- paste0(dataDir, "/upreg_ora_05.rds")
if (!file.exists(oraFile)) {
# Fetch ORA results from g:Profiler if they aren't saved and save them
upRegORA <- getGeneSets(upRegGenes, annotationData, bgGenes)
saveRDS(upRegORA, file=oraFile)
} else {
# Load ORA results (streamlines so I don't have to query g:Profiler each time)
upRegORA <- readRDS(oraFile)
}
# Save all ORA results in list
allOras <- list()
allOras[[1]] <- list(result = upRegORA$result,
sig = "FDR < 0.05", expr = "Up-regulated")
oraPlot <- gprofiler2::gostplot(upRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
'in Up-Regulated DEG List',
'(Threshold: FDR < 0.05)'),
xaxis = list(title = 'Annotation Dataset'))
Figure 5: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on significantly (FDR < 0.05) up-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).
The interactive plot in Figure 5 helps us easily identify the most significantly over-represented gene sets in our up-regulated gene list. However, if we look at the top give GO:BP gene sets from our results (Table 6), we notice that they have extremely vague names and each contains more than 3000 genes.
kableExtra::kable_styling(
knitr::kable(getTopGeneSets(upRegORA$result, "GO:BP", numSets=5),
caption=paste("Top five over-represented GO:BP gene sets in ORA",
"of up-regulated DEGs with FDR < 0.05"),
digits=42, type="html"),
"condensed")
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| GO:BP | GO:0050896 | response to stimulus | 4398 | 1.100000e-41 |
| GO:BP | GO:0008150 | biological_process | 10828 | 1.330195e-32 |
| GO:BP | GO:0048518 | positive regulation of biological process | 3649 | 5.069471e-32 |
| GO:BP | GO:0051716 | cellular response to stimulus | 3865 | 4.612058e-30 |
| GO:BP | GO:0032501 | multicellular organismal process | 3205 | 1.015804e-29 |
These “parent” gene sets are not informative, and therefore we will narrow down our analysis to smaller gene sets (containing no more than 200 genes).
showTopSmallGeneSets <- function(ora, annotationSources, numSets=10) {
smallGeneSets <- ora$result[ora$result$term_size <= 200, ]
for (dataset in annotationSources) {
geneSetKable <- knitr::kable(
getTopGeneSets(smallGeneSets, dataset, numSets=numSets),
caption=paste("Top", numSets, "over-represented small gene sets from",
dataset, "in ORA of", ora$expr, "DEGs with", ora$sig, ".",
"'Small' gene sets are defined as any gene sets with at",
"most 200 genes."),
digits=42, valign='t', row.names=FALSE, format="html")
print(kableExtra::kable_styling(geneSetKable, "condensed"))
cat('\n <br/> \n')
}
}
showTopSmallGeneSets(allOras[[1]], annotationData)
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| GO:BP | GO:0140546 | defense response to symbiont | 155 | 1.923044e-14 |
| GO:BP | GO:0051607 | defense response to virus | 155 | 1.923044e-14 |
| GO:BP | GO:1903900 | regulation of viral life cycle | 102 | 6.110663e-14 |
| GO:BP | GO:0050792 | regulation of viral process | 119 | 1.688086e-13 |
| GO:BP | GO:0019058 | viral life cycle | 196 | 1.103854e-12 |
| GO:BP | GO:0048525 | negative regulation of viral process | 69 | 2.111753e-11 |
| GO:BP | GO:0045071 | negative regulation of viral genome replication | 45 | 1.151882e-10 |
| GO:BP | GO:0019079 | viral genome replication | 103 | 7.242580e-10 |
| GO:BP | GO:0045069 | regulation of viral genome replication | 68 | 8.470246e-09 |
| GO:BP | GO:0002831 | regulation of response to biotic stimulus | 191 | 2.789158e-08 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| KEGG | KEGG:05169 | Epstein-Barr virus infection | 159 | 2.578725e-10 |
| KEGG | KEGG:03050 | Proteasome | 42 | 5.764911e-09 |
| KEGG | KEGG:05160 | Hepatitis C | 120 | 4.486228e-07 |
| KEGG | KEGG:05164 | Influenza A | 122 | 6.865981e-07 |
| KEGG | KEGG:04612 | Antigen processing and presentation | 43 | 1.896833e-06 |
| KEGG | KEGG:04210 | Apoptosis | 115 | 3.147859e-05 |
| KEGG | KEGG:05202 | Transcriptional misregulation in cancer | 126 | 3.147859e-05 |
| KEGG | KEGG:04360 | Axon guidance | 150 | 3.309012e-05 |
| KEGG | KEGG:04060 | Cytokine-cytokine receptor interaction | 113 | 1.027028e-04 |
| KEGG | KEGG:04217 | Necroptosis | 113 | 1.027028e-04 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| REAC | REAC:R-HSA-913531 | Interferon Signaling | 154 | 1.913644e-15 |
| REAC | REAC:R-HSA-909733 | Interferon alpha/beta signaling | 53 | 7.018134e-14 |
| REAC | REAC:R-HSA-1236975 | Antigen processing-Cross presentation | 84 | 3.766138e-13 |
| REAC | REAC:R-HSA-1236974 | ER-Phagosome pathway | 76 | 2.323265e-12 |
| REAC | REAC:R-HSA-5357801 | Programmed Cell Death | 188 | 3.259930e-12 |
| REAC | REAC:R-HSA-8852276 | The role of GTSE1 in G2/M progression after G2 checkpoint | 67 | 6.759750e-12 |
| REAC | REAC:R-HSA-109581 | Apoptosis | 161 | 3.895050e-11 |
| REAC | REAC:R-HSA-2467813 | Separation of Sister Chromatids | 174 | 2.417929e-10 |
| REAC | REAC:R-HSA-174178 | APC/C:Cdh1 mediated degradation of Cdc20 and other APC/C:Cdh1 targeted proteins in late mitosis/early G1 | 71 | 3.350562e-10 |
| REAC | REAC:R-HSA-176814 | Activation of APC/C and APC/C:Cdc20 mediated degradation of mitotic proteins | 74 | 3.371945e-10 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| WP | WP:WP183 | Proteasome degradation | 58 | 8.444368e-15 |
| WP | WP:WP5115 | Network map of SARS-CoV-2 signaling pathway | 136 | 1.988345e-12 |
| WP | WP:WP5039 | SARS-CoV-2 innate immunity evasion and cell-specific immune response | 50 | 3.512453e-07 |
| WP | WP:WP619 | Type II interferon signaling (IFNG) | 25 | 7.417131e-07 |
| WP | WP:WP254 | Apoptosis | 76 | 2.679573e-05 |
| WP | WP:WP2446 | Retinoblastoma gene in cancer | 87 | 2.679573e-05 |
| WP | WP:WP4197 | Immune response to tuberculosis | 22 | 2.679573e-05 |
| WP | WP:WP5083 | Neuroinflammation and glutamatergic signaling | 99 | 3.693096e-05 |
| WP | WP:WP2840 | Hair follicle development: cytodifferentiation - part 3 of 3 | 46 | 4.883351e-05 |
| WP | WP:WP4304 | Oligodendrocyte specification and differentiation, leading to myelin components for CNS | 15 | 2.008485e-04 |
Table 7 shows the ten most significantly over-represented gene sets from each annotation source, based on the up-regulated gene list with threshold 0.05. For each annotation dataset, the top gene sets were related to viral response, interferon signaling, cell cycle, and apoptosis. While these were not mentioned in the study, it is believable that kidney cells might react similarly to a toxic proteinopathy as they do to viral infection (viral response generally involves changes in the cell cycle, apoptosis, and interferon signaling).
oraFile <- paste0(dataDir, "/downreg_ora_05.rds")
if (!file.exists(oraFile)) {
downRegORA <- getGeneSets(downRegGenes, annotationData, bgGenes)
saveRDS(downRegORA, file=oraFile)
} else {
downRegORA <- readRDS(oraFile)
}
allOras[[2]] <- list(result = downRegORA$result,
sig = "FDR < 0.05", expr = "Down-regulated")
oraPlot <- gprofiler2::gostplot(downRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
'in Down-Regulated DEG List',
'(Threshold: FDR < 0.05)'),
xaxis = list(title = 'Annotation Dataset'),
yaxis = list(title = '-log10(FDR) (capped at 10^-16)'))
Figure 6: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on significantly (FDR < 0.05) down-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).
showTopSmallGeneSets(allOras[[2]], annotationData)
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| GO:BP | GO:0002181 | cytoplasmic translation | 130 | 1.259901e-24 |
| GO:BP | GO:0036293 | response to decreased oxygen levels | 138 | 2.403078e-09 |
| GO:BP | GO:0001666 | response to hypoxia | 135 | 3.103712e-09 |
| GO:BP | GO:0070482 | response to oxygen levels | 151 | 8.312130e-09 |
| GO:BP | GO:0036294 | cellular response to decreased oxygen levels | 88 | 1.039810e-06 |
| GO:BP | GO:0007610 | behavior | 130 | 1.291760e-06 |
| GO:BP | GO:0071456 | cellular response to hypoxia | 85 | 1.346778e-06 |
| GO:BP | GO:0071453 | cellular response to oxygen levels | 100 | 1.077212e-05 |
| GO:BP | GO:0045229 | external encapsulating structure organization | 140 | 1.816234e-04 |
| GO:BP | GO:0051960 | regulation of nervous system development | 186 | 2.779549e-04 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| KEGG | KEGG:03010 | Ribosome | 127 | 7.783092e-21 |
| KEGG | KEGG:05171 | Coronavirus disease - COVID-19 | 170 | 3.064324e-17 |
| KEGG | KEGG:04066 | HIF-1 signaling pathway | 83 | 5.906660e-04 |
| KEGG | KEGG:04974 | Protein digestion and absorption | 46 | 5.906660e-04 |
| KEGG | KEGG:04140 | Autophagy - animal | 132 | 9.272957e-04 |
| KEGG | KEGG:00010 | Glycolysis / Gluconeogenesis | 43 | 2.136640e-03 |
| KEGG | KEGG:05230 | Central carbon metabolism in cancer | 55 | 2.136640e-03 |
| KEGG | KEGG:04150 | mTOR signaling pathway | 132 | 7.217849e-03 |
| KEGG | KEGG:01230 | Biosynthesis of amino acids | 56 | 7.217849e-03 |
| KEGG | KEGG:05211 | Renal cell carcinoma | 66 | 8.472398e-03 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| REAC | REAC:R-HSA-156842 | Eukaryotic Translation Elongation | 89 | 8.519000e-39 |
| REAC | REAC:R-HSA-156902 | Peptide chain elongation | 85 | 8.519000e-39 |
| REAC | REAC:R-HSA-192823 | Viral mRNA Translation | 85 | 1.307770e-37 |
| REAC | REAC:R-HSA-72689 | Formation of a pool of free 40S subunits | 97 | 1.615670e-37 |
| REAC | REAC:R-HSA-156827 | L13a-mediated translational silencing of Ceruloplasmin expression | 107 | 1.621230e-37 |
| REAC | REAC:R-HSA-72706 | GTP hydrolysis and joining of the 60S ribosomal subunit | 108 | 5.528209e-36 |
| REAC | REAC:R-HSA-9633012 | Response of EIF2AK4 (GCN2) to amino acid deficiency | 97 | 2.166951e-35 |
| REAC | REAC:R-HSA-72613 | Eukaryotic Translation Initiation | 115 | 2.166951e-35 |
| REAC | REAC:R-HSA-72737 | Cap-dependent Translation Initiation | 115 | 2.166951e-35 |
| REAC | REAC:R-HSA-975956 | Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) | 91 | 7.906436e-35 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| WP | WP:WP477 | Cytoplasmic ribosomal proteins | 85 | 5.919000e-39 |
| WP | WP:WP4018 | Clear cell renal cell carcinoma pathways | 72 | 2.081253e-10 |
| WP | WP:WP1946 | Cori cycle | 13 | 4.594145e-06 |
| WP | WP:WP4629 | Aerobic glycolysis | 11 | 4.594145e-06 |
| WP | WP:WP534 | Glycolysis and gluconeogenesis | 33 | 8.763435e-06 |
| WP | WP:WP3614 | Photodynamic therapy-induced HIF-1 survival signaling | 32 | 1.608166e-04 |
| WP | WP:WP4290 | Metabolic reprogramming in colon cancer | 40 | 4.243684e-03 |
| WP | WP:WP107 | Translation factors | 49 | 5.212685e-03 |
| WP | WP:WP5094 | Orexin receptor pathway | 96 | 7.943289e-03 |
| WP | WP:WP4549 | Fragile X syndrome | 97 | 8.900015e-03 |
The common terms in the gene sets listed in Table 8 are translation, elongation, ribosomes, autophagy, glucogenesis and synthesis of nucleotide sugars. Again, it is logical that a cell affected by the accumulation of a misfolded protein would down-regulate its protein synthesis (ribosomal assembly, peptide translation and elongation, amino acid biosynthesis), in order to prevent further aggregation. Additionally, the down-regulation of autophagy is consistent with the study’s observations that MKD cells seem unable to clear the misfolded Muc1 proteins through lysosomal degradation. The other top terms from the annotation datasets are less cohesive: among them are response to hypoxia, demethylation, HIG-1 signaling, and carcinoma pathways.
oraFile <- paste0(dataDir, "/alldegs_ora_05.rds")
if (!file.exists(oraFile)) {
allGenes <- c(upRegGenes, downRegGenes)
sigORA <- getGeneSets(allGenes, annotationData, bgGenes)
saveRDS(sigORA, file=oraFile)
} else {
sigORA <- readRDS(oraFile)
}
allOras[[3]] <- list(result = sigORA$result,
sig = "FDR < 0.05", expr = "All DEGs")
oraPlot <- gprofiler2::gostplot(sigORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
'in DEG List (Threshold: FDR < 0.05)'),
xaxis = list(title = 'Annotation Dataset'))
Figure 7: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all significant (FDR < 0.05) DEGs in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).
showTopSmallGeneSets(allOras[[3]], annotationData)
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| GO:BP | GO:0002181 | cytoplasmic translation | 130 | 2.755235e-10 |
| GO:BP | GO:0045229 | external encapsulating structure organization | 140 | 1.711119e-09 |
| GO:BP | GO:0030198 | extracellular matrix organization | 138 | 5.489826e-09 |
| GO:BP | GO:0043062 | extracellular structure organization | 138 | 5.489826e-09 |
| GO:BP | GO:0042692 | muscle cell differentiation | 146 | 2.808452e-08 |
| GO:BP | GO:0140546 | defense response to symbiont | 155 | 2.808452e-08 |
| GO:BP | GO:0051607 | defense response to virus | 155 | 2.808452e-08 |
| GO:BP | GO:0007411 | axon guidance | 116 | 4.370059e-08 |
| GO:BP | GO:0097485 | neuron projection guidance | 116 | 4.370059e-08 |
| GO:BP | GO:0044057 | regulation of system process | 197 | 8.110596e-08 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| KEGG | KEGG:05171 | Coronavirus disease - COVID-19 | 170 | 9.549956e-17 |
| KEGG | KEGG:04512 | ECM-receptor interaction | 63 | 2.850300e-06 |
| KEGG | KEGG:05412 | Arrhythmogenic right ventricular cardiomyopathy | 51 | 2.850300e-06 |
| KEGG | KEGG:03010 | Ribosome | 127 | 3.749980e-06 |
| KEGG | KEGG:05410 | Hypertrophic cardiomyopathy | 57 | 8.562737e-06 |
| KEGG | KEGG:05206 | MicroRNAs in cancer | 142 | 1.022466e-05 |
| KEGG | KEGG:04514 | Cell adhesion molecules | 79 | 1.022466e-05 |
| KEGG | KEGG:05202 | Transcriptional misregulation in cancer | 126 | 1.022466e-05 |
| KEGG | KEGG:04974 | Protein digestion and absorption | 46 | 1.047406e-05 |
| KEGG | KEGG:05169 | Epstein-Barr virus infection | 159 | 1.676900e-05 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| REAC | REAC:R-HSA-156902 | Peptide chain elongation | 85 | 9.913603e-18 |
| REAC | REAC:R-HSA-156842 | Eukaryotic Translation Elongation | 89 | 1.648108e-17 |
| REAC | REAC:R-HSA-192823 | Viral mRNA Translation | 85 | 6.160864e-17 |
| REAC | REAC:R-HSA-1474244 | Extracellular matrix organization | 196 | 6.160864e-17 |
| REAC | REAC:R-HSA-9010553 | Regulation of expression of SLITs and ROBOs | 156 | 1.132525e-16 |
| REAC | REAC:R-HSA-376176 | Signaling by ROBO receptors | 194 | 6.919092e-16 |
| REAC | REAC:R-HSA-72689 | Formation of a pool of free 40S subunits | 97 | 1.264974e-15 |
| REAC | REAC:R-HSA-72764 | Eukaryotic Translation Termination | 89 | 3.968572e-15 |
| REAC | REAC:R-HSA-1799339 | SRP-dependent cotranslational protein targeting to membrane | 108 | 4.400929e-15 |
| REAC | REAC:R-HSA-975956 | Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) | 91 | 4.400929e-15 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| WP | WP:WP477 | Cytoplasmic ribosomal proteins | 85 | 1.387354e-18 |
| WP | WP:WP5115 | Network map of SARS-CoV-2 signaling pathway | 136 | 5.212673e-09 |
| WP | WP:WP5055 | Burn wound healing | 73 | 3.682437e-08 |
| WP | WP:WP183 | Proteasome degradation | 58 | 5.418779e-08 |
| WP | WP:WP2328 | Allograft Rejection | 40 | 5.549131e-07 |
| WP | WP:WP4018 | Clear cell renal cell carcinoma pathways | 72 | 3.321048e-06 |
| WP | WP:WP2877 | Vitamin D receptor pathway | 107 | 1.018915e-05 |
| WP | WP:WP2118 | Arrhythmogenic right ventricular cardiomyopathy | 50 | 1.591225e-05 |
| WP | WP:WP2431 | Spinal cord injury | 79 | 3.381740e-05 |
| WP | WP:WP5083 | Neuroinflammation and glutamatergic signaling | 99 | 3.381740e-05 |
Even though there are more significantly up-regulated genes than there are down-regulated genes in our dataset (2560 vs. 2042), it seems that the down-regulated pathways (specifically, ribosomal functions, translation, elongation) were more significantly over-represented in all the annotation datasets than the up-regulated pathways (Table 9).
getNumSigPathways <- function(ora) {
return(length(which(ora$p_value < 0.05)))
}
compareNumGeneSets <- function(oras, fdrs, logFCs, caption) {
numPathways <- unlist(lapply(oras, getNumSigPathways))
exprChanges <- unlist(lapply(logFCs, function(foldChange) {
if (foldChange == "0") {
return("All DEGs")
} else if (foldChange == "+") {
return("Up-regulated")
} else if (foldChange == "-") {
return("Down-regulated")
}
}))
gsKable <- knitr::kable(
data.frame(`DEG Sig. Threshold` = fdrs, `Expression Change` = exprChanges,
`Num. Gene Sets` = numPathways, check.names = FALSE),
caption=caption, type="html")
kableExtra::kable_styling(gsKable, "condensed")
}
oras <- list(sigORA$result, upRegORA$result, downRegORA$result)
compareNumGeneSets(oras, fdrs=rep("FDR < 0.05", 3), logFCs=c("0", "+", "-"),
caption = paste("Number of significantly over-represented",
"gene sets in gene lists with threshold",
"FDR < 0.05."))
| DEG Sig. Threshold | Expression Change | Num. Gene Sets |
|---|---|---|
| FDR < 0.05 | All DEGs | 1773 |
| FDR < 0.05 | Up-regulated | 1488 |
| FDR < 0.05 | Down-regulated | 449 |
As expected, the cumulative ORA (all significantly regulated genes) returned more gene sets than either of the individual ORAs (with only up- or down- regulated genes) (Table 10). However, the cumulative ORA returned fewer gene sets than the sum of the gene sets from the individual ORAs. The ORA on only down-regulated genes returned far fewer gene sets than the ORA with only up-regulated genes (~30% as many gene sets), even though the down-regulated gene list was almost 80% the length of the up-regulated gene list.
Now that I have a big picture view of the enriched pathways in MKD cells, I will use the gene lists generated by the stricter threshold (FDR < 0.001) to identify the most important and significantly regulated pathways.
oraFile <- paste0(dataDir, "/upreg_ora_001.rds")
if (!file.exists(oraFile)) {
veryUpRegORA <- getGeneSets(veryUpRegGenes, annotationData, bgGenes)
saveRDS(veryUpRegORA, file=oraFile)
} else {
veryUpRegORA <- readRDS(oraFile)
}
allOras[[4]] <- list(result = veryUpRegORA$result,
sig = "FDR < 0.001", expr = "Up-regulated")
oraPlot <- gprofiler2::gostplot(veryUpRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
'in Up-Regulated DEG List',
'(Threshold: FDR < 0.001)'),
xaxis = list(title = 'Annotation Dataset'))
Figure 8: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all very significantly (FDR < 0.001) up-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).
showTopSmallGeneSets(allOras[[4]], annotationData)
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| GO:BP | GO:0051607 | defense response to virus | 155 | 3.521539e-24 |
| GO:BP | GO:0140546 | defense response to symbiont | 155 | 3.521539e-24 |
| GO:BP | GO:0048525 | negative regulation of viral process | 69 | 1.359792e-18 |
| GO:BP | GO:1903900 | regulation of viral life cycle | 102 | 5.108345e-18 |
| GO:BP | GO:0050792 | regulation of viral process | 119 | 5.165447e-18 |
| GO:BP | GO:0045071 | negative regulation of viral genome replication | 45 | 1.085921e-16 |
| GO:BP | GO:0019058 | viral life cycle | 196 | 3.041602e-15 |
| GO:BP | GO:0002683 | negative regulation of immune system process | 181 | 6.477933e-14 |
| GO:BP | GO:0045069 | regulation of viral genome replication | 68 | 3.921574e-13 |
| GO:BP | GO:0019079 | viral genome replication | 103 | 2.720715e-12 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| KEGG | KEGG:05169 | Epstein-Barr virus infection | 159 | 1.189207e-10 |
| KEGG | KEGG:05164 | Influenza A | 122 | 3.553557e-10 |
| KEGG | KEGG:05160 | Hepatitis C | 120 | 8.128932e-07 |
| KEGG | KEGG:05162 | Measles | 104 | 2.176205e-06 |
| KEGG | KEGG:04612 | Antigen processing and presentation | 43 | 2.176205e-06 |
| KEGG | KEGG:03050 | Proteasome | 42 | 7.346302e-06 |
| KEGG | KEGG:05167 | Kaposi sarcoma-associated herpesvirus infection | 149 | 3.085640e-05 |
| KEGG | KEGG:05170 | Human immunodeficiency virus 1 infection | 167 | 1.531489e-04 |
| KEGG | KEGG:04514 | Cell adhesion molecules | 79 | 1.531489e-04 |
| KEGG | KEGG:05418 | Fluid shear stress and atherosclerosis | 111 | 2.023412e-04 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| REAC | REAC:R-HSA-913531 | Interferon Signaling | 154 | 4.041223e-24 |
| REAC | REAC:R-HSA-909733 | Interferon alpha/beta signaling | 53 | 1.205883e-21 |
| REAC | REAC:R-HSA-877300 | Interferon gamma signaling | 64 | 1.173120e-13 |
| REAC | REAC:R-HSA-1236974 | ER-Phagosome pathway | 76 | 3.518394e-11 |
| REAC | REAC:R-HSA-1236975 | Antigen processing-Cross presentation | 84 | 1.403507e-10 |
| REAC | REAC:R-HSA-381426 | Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like Growth Factor Binding Proteins (IGFBPs) | 78 | 2.634256e-09 |
| REAC | REAC:R-HSA-8957275 | Post-translational protein phosphorylation | 73 | 1.208023e-08 |
| REAC | REAC:R-HSA-5357801 | Programmed Cell Death | 188 | 1.760269e-08 |
| REAC | REAC:R-HSA-8852276 | The role of GTSE1 in G2/M progression after G2 checkpoint | 67 | 3.157421e-08 |
| REAC | REAC:R-HSA-109581 | Apoptosis | 161 | 8.001187e-08 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| WP | WP:WP5115 | Network map of SARS-CoV-2 signaling pathway | 136 | 1.258866e-12 |
| WP | WP:WP183 | Proteasome degradation | 58 | 2.042227e-12 |
| WP | WP:WP619 | Type II interferon signaling (IFNG) | 25 | 5.861221e-10 |
| WP | WP:WP4197 | Immune response to tuberculosis | 22 | 8.494371e-09 |
| WP | WP:WP4868 | Type I interferon induction and signaling during SARS-CoV-2 infection | 26 | 2.181967e-07 |
| WP | WP:WP5039 | SARS-CoV-2 innate immunity evasion and cell-specific immune response | 50 | 2.450495e-07 |
| WP | WP:WP2840 | Hair follicle development: cytodifferentiation - part 3 of 3 | 46 | 7.812426e-06 |
| WP | WP:WP4880 | Host-pathogen interaction of human coronaviruses - interferon induction | 30 | 1.301529e-05 |
| WP | WP:WP5083 | Neuroinflammation and glutamatergic signaling | 99 | 3.412772e-05 |
| WP | WP:WP4217 | Ebola virus pathway in host | 107 | 1.629663e-04 |
The top gene sets returned by this analysis, listed in Table 11, were almost exactly identical to the top gene sets returned by the ORA with significantly (FDR < 0.05) up-regulated genes (Table 7). The only difference is that the p-value associated with each returned gene set is a lot smaller (and more significant) in this analysis than the p-value from the previous (FDR < 0.05) analysis. This is probably because the query gene list is smaller, and so the gene sets are even more over-represented.
oras <- list(upRegORA$result, veryUpRegORA$result)
compareNumGeneSets(oras, fdrs=c("FDR < 0.05", "FDR < 0.001"),
logFCs=c("+", "+"),
caption = paste("Number of significantly over-represented",
"gene sets in up-regulated gene lists with",
"different thresholds."))
| DEG Sig. Threshold | Expression Change | Num. Gene Sets |
|---|---|---|
| FDR < 0.05 | Up-regulated | 1488 |
| FDR < 0.001 | Up-regulated | 1570 |
Table 12 shows that the very significantly (FDR < 0.001) up-regulated gene list returned 82 more significantly expressed gene sets than the up-regulated gene list with FDR < 0.05.
oraFile <- paste0(dataDir, "/downreg_ora_001.rds")
if (!file.exists(oraFile)) {
veryDownRegORA <- getGeneSets(veryDownRegGenes, annotationData, bgGenes)
saveRDS(veryDownRegORA, file=oraFile)
} else {
veryDownRegORA <- readRDS(oraFile)
}
allOras[[5]] <- list(result = veryDownRegORA$result,
sig = "FDR < 0.001", expr = "Down-regulated")
oraPlot <-
gprofiler2::gostplot(veryDownRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
'in Down-Regulated DEG List',
'(Threshold: FDR < 0.001)'),
xaxis = list(title = 'Annotation Dataset'))
Figure 9: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all very significantly (FDR < 0.001) down-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).
showTopSmallGeneSets(allOras[[5]], annotationData)
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| GO:BP | GO:0001666 | response to hypoxia | 135 | 4.774269e-11 |
| GO:BP | GO:0036293 | response to decreased oxygen levels | 138 | 9.364504e-11 |
| GO:BP | GO:0070482 | response to oxygen levels | 151 | 4.529153e-10 |
| GO:BP | GO:0002181 | cytoplasmic translation | 130 | 6.159332e-08 |
| GO:BP | GO:0071456 | cellular response to hypoxia | 85 | 1.885910e-07 |
| GO:BP | GO:0098742 | cell-cell adhesion via plasma-membrane adhesion molecules | 99 | 2.976675e-07 |
| GO:BP | GO:0036294 | cellular response to decreased oxygen levels | 88 | 3.688665e-07 |
| GO:BP | GO:0071453 | cellular response to oxygen levels | 100 | 1.182984e-06 |
| GO:BP | GO:0035725 | sodium ion transmembrane transport | 71 | 2.953592e-05 |
| GO:BP | GO:0045229 | external encapsulating structure organization | 140 | 8.088140e-05 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| KEGG | KEGG:03010 | Ribosome | 127 | 2.047333e-06 |
| KEGG | KEGG:05171 | Coronavirus disease - COVID-19 | 170 | 8.714660e-06 |
| KEGG | KEGG:04066 | HIF-1 signaling pathway | 83 | 2.960316e-04 |
| KEGG | KEGG:00010 | Glycolysis / Gluconeogenesis | 43 | 3.638603e-03 |
| KEGG | KEGG:04512 | ECM-receptor interaction | 63 | 3.638603e-03 |
| KEGG | KEGG:04974 | Protein digestion and absorption | 46 | 6.363647e-03 |
| KEGG | KEGG:00500 | Starch and sucrose metabolism | 20 | 1.972024e-02 |
| KEGG | KEGG:04020 | Calcium signaling pathway | 119 | 2.407286e-02 |
| KEGG | KEGG:01250 | Biosynthesis of nucleotide sugars | 34 | 2.407286e-02 |
| KEGG | KEGG:04972 | Pancreatic secretion | 47 | 2.407286e-02 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| REAC | REAC:R-HSA-156827 | L13a-mediated translational silencing of Ceruloplasmin expression | 107 | 1.024888e-12 |
| REAC | REAC:R-HSA-156842 | Eukaryotic Translation Elongation | 89 | 1.024888e-12 |
| REAC | REAC:R-HSA-72689 | Formation of a pool of free 40S subunits | 97 | 1.024888e-12 |
| REAC | REAC:R-HSA-72737 | Cap-dependent Translation Initiation | 115 | 1.094750e-12 |
| REAC | REAC:R-HSA-72613 | Eukaryotic Translation Initiation | 115 | 1.094750e-12 |
| REAC | REAC:R-HSA-156902 | Peptide chain elongation | 85 | 1.711348e-12 |
| REAC | REAC:R-HSA-192823 | Viral mRNA Translation | 85 | 1.711348e-12 |
| REAC | REAC:R-HSA-72706 | GTP hydrolysis and joining of the 60S ribosomal subunit | 108 | 2.660910e-12 |
| REAC | REAC:R-HSA-975956 | Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) | 91 | 1.247020e-11 |
| REAC | REAC:R-HSA-9633012 | Response of EIF2AK4 (GCN2) to amino acid deficiency | 97 | 1.366156e-11 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| WP | WP:WP477 | Cytoplasmic ribosomal proteins | 85 | 2.402055e-11 |
| WP | WP:WP4018 | Clear cell renal cell carcinoma pathways | 72 | 6.477499e-08 |
| WP | WP:WP4629 | Aerobic glycolysis | 11 | 2.137217e-05 |
| WP | WP:WP3614 | Photodynamic therapy-induced HIF-1 survival signaling | 32 | 2.987092e-05 |
| WP | WP:WP1946 | Cori cycle | 13 | 9.643910e-05 |
| WP | WP:WP534 | Glycolysis and gluconeogenesis | 33 | 2.157779e-04 |
| WP | WP:WP5097 | CCL18 signaling pathway | 40 | 1.631661e-03 |
| WP | WP:WP4786 | Type I collagen synthesis in the context of osteogenesis imperfecta | 28 | 7.995232e-03 |
| WP | WP:WP5055 | Burn wound healing | 73 | 1.463845e-02 |
| WP | WP:WP4707 | Aspirin and miRNAs | 10 | 1.612661e-02 |
compareNumGeneSets(list(downRegORA$result, veryDownRegORA$result),
fdrs=c("FDR < 0.05", "FDR < 0.001"),
logFCs=c("-", "-"),
caption = paste("Number of significantly over-represented",
"gene sets in down-regulated gene lists with",
"different thresholds."))
| DEG Sig. Threshold | Expression Change | Num. Gene Sets |
|---|---|---|
| FDR < 0.05 | Down-regulated | 449 |
| FDR < 0.001 | Down-regulated | 542 |
As with the up-regulated ORAs, the gene sets returned in this analysis (listed in Table 13) are almost exactly the same and in nearly the same order as gene sets from the ORA with significantly (FDR < 0.05) down-regulated genes (Table 8). Surprisingly, the p-values were bigger (less significant) in the ORA produced by the smaller gene list (stricter threshold). This suggests that the less significant DEGs were important in the down-regulated gene set enrichment analysis, since removing them resulted in less significant results.
The ORA from the stricter-thresholded down-regulated gene list (FDR < 0.001) resulted in 93 more significant gene sets than the weaker-thresholded gene list (FDR < 0.05) (Table 14).
# Load or fetch ORA results for all very sig DEGs
oraFile <- paste0(dataDir, "/alldegs_ora_001.rds")
if (!file.exists(oraFile)) {
allGenes <- c(veryUpRegGenes, veryDownRegGenes)
verySigORA <- getGeneSets(allGenes, annotationData, bgGenes)
saveRDS(verySigORA, file=oraFile)
} else {
verySigORA <- readRDS(oraFile)
}
# Save to list of ORAs
allOras[[6]] <- list(result = verySigORA$result,
sig = "FDR < 0.001", expr = "All")
oraPlot <- gprofiler2::gostplot(verySigORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
'in DEG List (Threshold: FDR < 0.001)'),
xaxis = list(title = 'Annotation Dataset'),
yaxis = list(title = '-log10(FDR) (capped at 10^-16)'))
Figure 10: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all very significant (FDR < 0.001) DEGs in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).
showTopSmallGeneSets(allOras[[6]], annotationData)
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| GO:BP | GO:0051607 | defense response to virus | 155 | 2.768761e-15 |
| GO:BP | GO:0140546 | defense response to symbiont | 155 | 2.768761e-15 |
| GO:BP | GO:0002683 | negative regulation of immune system process | 181 | 4.123843e-12 |
| GO:BP | GO:0098742 | cell-cell adhesion via plasma-membrane adhesion molecules | 99 | 7.209449e-12 |
| GO:BP | GO:0050792 | regulation of viral process | 119 | 1.112942e-11 |
| GO:BP | GO:1903900 | regulation of viral life cycle | 102 | 2.987380e-11 |
| GO:BP | GO:0048525 | negative regulation of viral process | 69 | 3.196549e-11 |
| GO:BP | GO:0045071 | negative regulation of viral genome replication | 45 | 3.746031e-11 |
| GO:BP | GO:0007162 | negative regulation of cell adhesion | 159 | 3.130204e-10 |
| GO:BP | GO:0019058 | viral life cycle | 196 | 3.666078e-10 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| KEGG | KEGG:05171 | Coronavirus disease - COVID-19 | 170 | 2.704195e-10 |
| KEGG | KEGG:04512 | ECM-receptor interaction | 63 | 4.614595e-08 |
| KEGG | KEGG:05169 | Epstein-Barr virus infection | 159 | 2.597519e-06 |
| KEGG | KEGG:05202 | Transcriptional misregulation in cancer | 126 | 5.516569e-06 |
| KEGG | KEGG:04514 | Cell adhesion molecules | 79 | 5.516569e-06 |
| KEGG | KEGG:05164 | Influenza A | 122 | 5.516569e-06 |
| KEGG | KEGG:04510 | Focal adhesion | 163 | 7.978976e-06 |
| KEGG | KEGG:04974 | Protein digestion and absorption | 46 | 1.865757e-05 |
| KEGG | KEGG:05418 | Fluid shear stress and atherosclerosis | 111 | 2.724035e-05 |
| KEGG | KEGG:04020 | Calcium signaling pathway | 119 | 2.775534e-05 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| REAC | REAC:R-HSA-1474244 | Extracellular matrix organization | 196 | 7.676585e-20 |
| REAC | REAC:R-HSA-913531 | Interferon Signaling | 154 | 6.069943e-15 |
| REAC | REAC:R-HSA-909733 | Interferon alpha/beta signaling | 53 | 2.470923e-13 |
| REAC | REAC:R-HSA-381426 | Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like Growth Factor Binding Proteins (IGFBPs) | 78 | 1.659044e-11 |
| REAC | REAC:R-HSA-8957275 | Post-translational protein phosphorylation | 73 | 1.220846e-10 |
| REAC | REAC:R-HSA-1474290 | Collagen formation | 64 | 2.862371e-09 |
| REAC | REAC:R-HSA-877300 | Interferon gamma signaling | 64 | 2.862371e-09 |
| REAC | REAC:R-HSA-2022090 | Assembly of collagen fibrils and other multimeric structures | 44 | 1.350811e-08 |
| REAC | REAC:R-HSA-1236974 | ER-Phagosome pathway | 76 | 2.316021e-07 |
| REAC | REAC:R-HSA-6785807 | Interleukin-4 and Interleukin-13 signaling | 70 | 2.684536e-07 |
| Source | Term ID | Term Name | Term Size | P-Value |
|---|---|---|---|---|
| WP | WP:WP5115 | Network map of SARS-CoV-2 signaling pathway | 136 | 1.695931e-09 |
| WP | WP:WP5055 | Burn wound healing | 73 | 6.377130e-08 |
| WP | WP:WP183 | Proteasome degradation | 58 | 6.539560e-08 |
| WP | WP:WP619 | Type II interferon signaling (IFNG) | 25 | 2.597927e-07 |
| WP | WP:WP2911 | miRNA targets in ECM and membrane receptors | 18 | 2.704356e-06 |
| WP | WP:WP4018 | Clear cell renal cell carcinoma pathways | 72 | 7.120623e-06 |
| WP | WP:WP3614 | Photodynamic therapy-induced HIF-1 survival signaling | 32 | 8.286784e-06 |
| WP | WP:WP4197 | Immune response to tuberculosis | 22 | 1.187142e-05 |
| WP | WP:WP306 | Focal adhesion | 160 | 1.214604e-05 |
| WP | WP:WP4868 | Type I interferon induction and signaling during SARS-CoV-2 infection | 26 | 3.269936e-05 |
Unlike the previous two ORAs, this analysis returned remarkably different gene sets than its less significant counterpart (all DEGs, FDR < 0.05). The top gene sets in this cumulative ORA (listed in Table 15) had primarily up-regulated gene sets, whereas the previous cumulative ORA had mostly down-regulated gene sets (Table 9). In particular, I noticed that gene sets related to ribosomes, translation, and elongation did not make the top-ten list in this analysis, and were instead replaced with gene sets related to viral response, cell cycle, and interferon signaling. This reflects the results from above, where the up-regulated ORA with FDR < 0.001 DEGs had more significant p-values but the down-regulated ORA with FDR < 0.001 DEGs had less significant p-values. Thus, it seems that using a more restricted gene list (with a more stringent threshold) causes up-regulated pathways to be more significantly over-represented than down-regulated pathways. This suggests that the most significant DEGs are up-regulated rather than down-regulated. The volcano plot earlier in this report confirms this pattern: the up-regulated DEGs had larger logFCs than the down-regulated DEGs, and the top part of the plot was more densely populated with red dots (up-regulated genes) than blue dots (down-regulated genes).
oras <- list(sigORA$result, verySigORA$result)
compareNumGeneSets(oras, fdrs=c("FDR < 0.05", "FDR < 0.001"),
logFCs=c("0", "0"),
caption = paste("Number of significantly over-represented",
"gene sets in DEG lists with two",
"different thresholds."))
| DEG Sig. Threshold | Expression Change | Num. Gene Sets |
|---|---|---|
| FDR < 0.05 | All DEGs | 1773 |
| FDR < 0.001 | All DEGs | 1882 |
Table 16 shows that the ORA from the stricter-thresholded DEG list (FDR < 0.001) returned 109 more significant gene sets than the weaker-thresholded DEG list (FDR < 0.05).
oras <- list(verySigORA$result, veryUpRegORA$result, veryDownRegORA$result)
compareNumGeneSets(oras, fdrs=rep("FDR < 0.001", 3), logFCs=c("0", "+", "-"),
caption = paste("Number of significantly over-represented",
"gene sets in gene lists with threshold",
"FDR < 0.001."))
| DEG Sig. Threshold | Expression Change | Num. Gene Sets |
|---|---|---|
| FDR < 0.001 | All DEGs | 1882 |
| FDR < 0.001 | Up-regulated | 1570 |
| FDR < 0.001 | Down-regulated | 542 |
Table 17 shows the number of gene sets returned in each ORA that used a DEG significance threshold of 0.001. As in the weaker-threshold (FDR < 0.05) case, the number of gene sets returned by the cumulative ORA (all DEGs) was greater than either of the individual ORAs (only up- or down-regulated DEGs), but less than the sum of the individual ORAs. Again, the down-regulated gene list returned much fewer gene sets than the up-regulated gene list.
Table 18 summarizes the results of the six ORAs performed above. Each cell shows the number of significantly over-represented gene sets found in each gene list, and each gene list is characterized by the FDR threshold (columns) and the expression change (rows).
numSigPathways <- unlist(lapply(
list(sigORA$result, upRegORA$result, downRegORA$result),
getNumSigPathways
))
numVerySigPathways <- unlist(lapply(
list(verySigORA$result, veryUpRegORA$result, veryDownRegORA$result),
getNumSigPathways
))
numGeneSetsDf <- data.frame("FDR < 0.05" = numSigPathways,
"FDR < 0.001" = numVerySigPathways, check.names=F)
rownames(numGeneSetsDf) <- c("All DEGs", "Up-regulated", "Down-regulated")
summaryKable <- knitr::kable(
numGeneSetsDf, type="html",
caption = paste("Number of significantly over-represented gene sets returned",
"by each ORA"))
kableExtra::kable_styling(summaryKable, "condensed")
| FDR < 0.05 | FDR < 0.001 | |
|---|---|---|
| All DEGs | 1773 | 1882 |
| Up-regulated | 1488 | 1570 |
| Down-regulated | 449 | 542 |
In all cases, the stricter-threshold ORA returned more gene sets than its weaker-threshold counterpart.
The original study claimed that (1) UPR is generally up-regulated in MKD cells relative to normal kidney cells, and the ATF6 branch of UPR in particular is up-regulated; and (2) misfolded MUC1 proteins are trapped inside TMED9-containing vesicles in the early secretory pathway (between the ER and the Golgi), preventing misfolded Muc1 from being degraded in the lysosomes (Dvela-Levitt et al., 2019). I will investigate both of these claims, using the two helper functions below.
queryPathways <- function(queryTerm, oraList, significant=TRUE,
caseSensitive=FALSE) {
# initialize pathways data.frame
pathways <- as.data.frame(matrix(nrow=0, ncol=7))
colnames(pathways) <- c("sig", "expr","source", "term_id", "term_name",
"term_size", "p_value")
for (ora in oraList) {
# If significant option is set, remove any non-sig gene sets
if (significant) {
geneSets <- ora$result[ora$result$significant, ]
} else {
geneSets <- ora$result
}
# Get all pathways with the query term from ORA results
if (any(grepl(queryTerm, geneSets$term_name, ignore.case=T))) {
df <- geneSets[grepl(queryTerm, geneSets$term_name, ignore.case=T), ]
df$sig <- ora$sig
df$expr <- paste(ora$expr, "DEGs")
pathways <- rbind(pathways, df[, colnames(pathways)])
}
}
# Sort pathways by pathways ID, DEG expression change, and DEG threshold
pathways$expr <- factor(pathways$expr)
pathways <- pathways[with(pathways,
order(term_id, -expr, sig)), ]
# Remove duplicate pathways, with preference to up/down regulated or
# stricter threshold (FDR < 0.001)
pathways <- pathways[!duplicated(pathways$term_id), ]
# Print table
cap <- paste0("Pathways containing the term '", queryTerm, "'")
if (significant) {
cap <- paste(cap,
"that are significantly over-represented in any of the ORAs.")
}
pathwaysKable <- knitr::kable(
pathways,
col.names = c("DEG Sig. Threshold", "DEG Expr. Change",
"Annotation Dataset", "Gene Set ID","Gene Set Name",
"Gene Set Size", "P-Value"),
caption=cap, row.names = FALSE, type="html", digits=42)
kableExtra::kable_styling(pathwaysKable, "condensed")
}
First, I will search all the ORA results for significantly regulated pathways that have the terms “UPR,” “ATF6,” “IRE1,” or “PERK” in their name (ATF6, IRE1, and PERK are the three branches of UPR).
queryPathways("UPR", allOras, significant = T, caseSensitive = T)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0097435 | supramolecular fiber organization | 496 | 0.0001013347 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:1902903 | regulation of supramolecular fiber organization | 246 | 0.0123923493 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:1902904 | negative regulation of supramolecular fiber organization | 106 | 0.0297628916 |
| FDR < 0.05 | All DEGs DEGs | GO:BP | GO:1902905 | positive regulation of supramolecular fiber organization | 117 | 0.0445919054 |
queryPathways("ATF6", allOras, significant = T)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
queryPathways("IRE1", allOras, significant = T)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
queryPathways("PERK", allOras, significant = T)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
Tables 19 - 22 reveal that there are no significantly enriched pathways containing the terms “UPR,” “ATF6,” “IRE1,” or “PERK” in their name, respectively. As a sanity check, I can turn off the significance requirement when querying pathways.
queryPathways("UPR", allOras, significant = F, caseSensitive = T)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0097435 | supramolecular fiber organization | 496 | 0.0001013347 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:1902903 | regulation of supramolecular fiber organization | 246 | 0.0123923493 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:1902904 | negative regulation of supramolecular fiber organization | 106 | 0.0297628916 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:1902905 | positive regulation of supramolecular fiber organization | 117 | 0.3463039114 |
| FDR < 0.001 | Up-regulated DEGs | REAC | REAC:R-HSA-381119 | Unfolded Protein Response (UPR) | 86 | 0.5312205229 |
| FDR < 0.001 | Up-regulated DEGs | WP | WP:WP4479 | Supression of HMGB1 mediated inflammation by THBD | 7 | 0.1073607927 |
queryPathways("ATF6", allOras, significant = F)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.001 | Down-regulated DEGs | GO:BP | GO:0036500 | ATF6-mediated unfolded protein response | 7 | 0.5948911 |
| FDR < 0.001 | Down-regulated DEGs | REAC | REAC:R-HSA-381033 | ATF6 (ATF6-alpha) activates chaperones | 11 | 0.9436722 |
| FDR < 0.001 | Down-regulated DEGs | REAC | REAC:R-HSA-381183 | ATF6 (ATF6-alpha) activates chaperone genes | 9 | 0.9112982 |
Tables 23 - 24 prove that there are pathways associated with general UPR and ATF6-mediated UPR, but that none of those pathways are significantly regulated. They all have extremely large p-values (> 0.5), and therefore contradict the results from the original study.
Next, I will look for pathways related to vesicle transport between the ER and the Golgi, specifically TMED9-containing vesicles.
queryPathways("TMED9", allOras, significant = TRUE)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
queryPathways("vesicle", allOras, significant = TRUE)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.001 | Down-regulated DEGs | GO:BP | GO:0016192 | vesicle-mediated transport | 966 | 0.01848052 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0060627 | regulation of vesicle-mediated transport | 302 | 0.01978143 |
| FDR < 0.001 | Up-regulated DEGs | WP | WP:WP4300 | Extracellular vesicles in the crosstalk of cardiac cells | 14 | 0.01477859 |
queryPathways("Golgi", allOras, significant = TRUE)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.05 | All DEGs DEGs | REAC | REAC:R-HSA-6811436 | COPI-independent Golgi-to-ER retrograde traffic | 42 | 0.02258587 |
| FDR < 0.001 | Up-regulated DEGs | REAC | REAC:R-HSA-975576 | N-glycan antennae elongation in the medial/trans-Golgi | 19 | 0.01299464 |
The search for TMED9 returned no results (Table 25); however this is expected since TMED9 is simply a protein-binding cargo receptor which likely does not have any pathways named after it.
The other two queries did return results. Table 26 shows three pathways from GO:BP and WikiPathways that are significantly enriched in at least one of the thresholded gene lists, and Table 27 shows two pathways from Reactome that are significantly enriched. The results suggest that vesicle-mediated transport is down-regulated and that there is differential regulation of COPI-independent Golgi-to-ER retrograde traffic (although it is unclear whether that pathway is up- or down-regulated). The original study shows that anterograde trafficking from the Golgi back to the ER is necessary for the lysosomal degradation and clearing of misfolded Muc1, and claims that interference in this early secretory pathways causes the progression of MKD (Dvela-Levitt et al., 2019). Thus, my ORA results align well with the study’s results: down-regulation of vesicle-mediated transport and potential down-regulation of retrograde transport would both greatly hinder the secretory pathway, thus preventing accumulated misfolded Muc1 from being transported to the lysosome for degradation.
Unfortunately, there are very few publications about MKD. This is partly because MKD is difficult to diagnose and differentiate from other genetic kidney disease; it is caused by a subtle mutation that is hard to identify and it has no symptoms except for tubular fibrosis in the kidneys (Blumenstiel et al., 2016; Eckardt et al., 2015). All of these factors make MKD challenging to study, so the field of evidence related to this disease is extremely sparse.
However, a review article by Sho Hasegawa and Reiko Inagi discusses the role of UPR as well as many other pathways involved in kidney disease (Hasegawa & Inagi, 2020). Specifically, Hasegawa and Inagi explain the causes and responses of ER stress in kidney cells (Hasegawa & Inagi, 2020). A toxic proteinopathy like MKD is a form of ER stress (Dvela-Levitt et al., 2019); therefore, pathways related to ER stress should be enriched in MKD cells. This turns out to be the case for this RNASeq dataset.
First, Hasegawa and Inagi state that oxidative stress and hypoxia can cause pathogenic ER stress in the kidneys (Hasegawa & Inagi, 2020). Table 28 reveals that eight hypoxia-related gene sets are significantly enriched in one of the thresholded gene lists. Based on Table 28 alone, it is unclear whether cellular response to hypoxia is up-regulated or down-regulated in MKD cells: the GO results suggest down-regulation while the Reactome results suggest up-regulation.
Second, Hasegawa and Inagi claim that ER stress in tubular cells can cause tubular apoptosis (Hasegawa & Inagi, 2020). Apoptosis was one of the most-significantly up-regulated pathways in our ORA tables (Tables 7, 11). Similarly, Table 29 shows that apoptosis-related pathways in three different annotation datasets (KEGG, Reactome, WikiPathways) are highly up-regulated in MKD cells.
Finally, Hasegawa and Inagi discuss how during the adaptive phase of ER stress, crosstalk between mitochondria and the ER result in increased calcium uptake (Hasegawa & Inagi, 2020). Table 30 shows that many calcium ion transport pathways from GO:BP and KEGG are significantly up-regulated in MKD cells.
Combining all of these observations, it is clear that many of the pathways up-regulated in MKD cells are related to ER stress. Although these relationships are indirect (e.g., the pathways reveal symptoms of ER stress), they are abundant and significant. One would expect MKD cells to be strongly enriched in UPR, but also enriched in pathways related to ER stress. My ORA results do not support the former of those hypotheses, but they do support the latter. Therefore, the results of my over-representation analysis are only partly supported by existing publications.
queryPathways("hypoxia", allOras, significant = TRUE)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.001 | Down-regulated DEGs | GO:BP | GO:0001666 | response to hypoxia | 135 | 4.774269e-11 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0061419 | positive regulation of transcription from RNA polymerase II promoter in response to hypoxia | 5 | 3.172140e-02 |
| FDR < 0.001 | Down-regulated DEGs | GO:BP | GO:0071456 | cellular response to hypoxia | 85 | 1.885910e-07 |
| FDR < 0.001 | All DEGs | GO:BP | GO:1900038 | negative regulation of cellular response to hypoxia | 7 | 2.216487e-02 |
| FDR < 0.001 | All DEGs | GO:BP | GO:1990144 | intrinsic apoptotic signaling pathway in response to hypoxia | 5 | 3.172140e-02 |
| FDR < 0.001 | Down-regulated DEGs | REAC | REAC:R-HSA-1234158 | Regulation of gene expression by Hypoxia-inducible Factor | 9 | 6.885987e-03 |
| FDR < 0.001 | Up-regulated DEGs | REAC | REAC:R-HSA-1234174 | Cellular response to hypoxia | 70 | 1.168336e-04 |
| FDR < 0.001 | Up-regulated DEGs | REAC | REAC:R-HSA-1234176 | Oxygen-dependent proline hydroxylation of Hypoxia-inducible Factor Alpha | 62 | 6.982746e-05 |
queryPathways("apoptosis", allOras, significant = TRUE)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.001 | Up-regulated DEGs | KEGG | KEGG:04210 | Apoptosis | 115 | 5.591527e-03 |
| FDR < 0.001 | Up-regulated DEGs | REAC | REAC:R-HSA-109581 | Apoptosis | 161 | 8.001187e-08 |
| FDR < 0.05 | Up-regulated DEGs | REAC | REAC:R-HSA-109606 | Intrinsic Pathway for Apoptosis | 53 | 3.315779e-02 |
| FDR < 0.001 | Up-regulated DEGs | REAC | REAC:R-HSA-169911 | Regulation of Apoptosis | 50 | 3.062822e-06 |
| FDR < 0.001 | Up-regulated DEGs | WP | WP:WP1772 | Apoptosis modulation and signaling | 82 | 2.129429e-02 |
| FDR < 0.001 | Up-regulated DEGs | WP | WP:WP254 | Apoptosis | 76 | 1.311449e-02 |
queryPathways("calcium", allOras, significant = TRUE)
| DEG Sig. Threshold | DEG Expr. Change | Annotation Dataset | Gene Set ID | Gene Set Name | Gene Set Size | P-Value |
|---|---|---|---|---|---|---|
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0006816 | calcium ion transport | 166 | 0.0297733275 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0006874 | cellular calcium ion homeostasis | 172 | 0.0005372135 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0007204 | positive regulation of cytosolic calcium ion concentration | 106 | 0.0033734356 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0010522 | regulation of calcium ion transport into cytosol | 46 | 0.0015475911 |
| FDR < 0.05 | Up-regulated DEGs | GO:BP | GO:0010880 | regulation of release of sequestered calcium ion into cytosol by sarcoplasmic reticulum | 15 | 0.0253346820 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0014808 | release of sequestered calcium ion into cytosol by sarcoplasmic reticulum | 17 | 0.0374391277 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0016339 | calcium-dependent cell-cell adhesion via plasma membrane cell adhesion molecules | 12 | 0.0168105886 |
| FDR < 0.001 | Down-regulated DEGs | GO:BP | GO:0017156 | calcium-ion regulated exocytosis | 30 | 0.0307734435 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0019722 | calcium-mediated signaling | 92 | 0.0300799502 |
| FDR < 0.05 | Up-regulated DEGs | GO:BP | GO:0035584 | calcium-mediated signaling using intracellular calcium source | 13 | 0.0395763912 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0050849 | negative regulation of calcium-mediated signaling | 10 | 0.0268160792 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051208 | sequestering of calcium ion | 58 | 0.0312016922 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051209 | release of sequestered calcium ion into cytosol | 53 | 0.0158892093 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051279 | regulation of release of sequestered calcium ion into cytosol | 39 | 0.0036915856 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051281 | positive regulation of release of sequestered calcium ion into cytosol | 23 | 0.0413259942 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051282 | regulation of sequestering of calcium ion | 55 | 0.0212373947 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051283 | negative regulation of sequestering of calcium ion | 54 | 0.0186149059 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051480 | regulation of cytosolic calcium ion concentration | 116 | 0.0001524332 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0051481 | negative regulation of cytosolic calcium ion concentration | 3 | 0.0390109706 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0051592 | response to calcium ion | 68 | 0.0046771321 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051924 | regulation of calcium ion transport | 103 | 0.0053061602 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0051928 | positive regulation of calcium ion transport | 51 | 0.0119075304 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0055074 | calcium ion homeostasis | 175 | 0.0003510758 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0060314 | regulation of ryanodine-sensitive calcium-release channel activity | 18 | 0.0024371653 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0060401 | cytosolic calcium ion transport | 86 | 0.0332778773 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0060402 | calcium ion transport into cytosol | 70 | 0.0050229025 |
| FDR < 0.05 | All DEGs DEGs | GO:BP | GO:0070296 | sarcoplasmic reticulum calcium ion transport | 19 | 0.0166465155 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0070509 | calcium ion import | 35 | 0.0440922141 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0070588 | calcium ion transmembrane transport | 125 | 0.0077475354 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0071277 | cellular response to calcium ion | 43 | 0.0168355779 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0090279 | regulation of calcium ion import | 16 | 0.0267657284 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:0097553 | calcium ion transmembrane import into cytosol | 64 | 0.0051415712 |
| FDR < 0.001 | All DEGs | GO:BP | GO:0098703 | calcium ion import across plasma membrane | 10 | 0.0046871685 |
| FDR < 0.001 | All DEGs | GO:BP | GO:1902656 | calcium ion import into cytosol | 10 | 0.0046871685 |
| FDR < 0.001 | Up-regulated DEGs | GO:BP | GO:1903169 | regulation of calcium ion transmembrane transport | 71 | 0.0297733275 |
| FDR < 0.001 | All DEGs | GO:BP | GO:1903514 | release of sequestered calcium ion into cytosol by endoplasmic reticulum | 17 | 0.0374391277 |
| FDR < 0.001 | All DEGs | GO:BP | GO:1990927 | calcium ion regulated lysosome exocytosis | 3 | 0.0390109706 |
| FDR < 0.001 | Up-regulated DEGs | KEGG | KEGG:04020 | Calcium signaling pathway | 119 | 0.0084232981 |
| FDR < 0.05 | All DEGs DEGs | KEGG | KEGG:04961 | Endocrine and other factor-regulated calcium reabsorption | 36 | 0.0156409145 |
The common patterns I saw in the over-representation analysis of GSE129943 was an up-regulation of pathways related to viral response and hypoxia, and a down-regulation of pathways related to protein translation and vesicle-mediated transport. These results partly aligned with the original study (Dvela-Levitt et al., 2019) and were partly supported by other publications. A fuller gene set enrichment analysis is required to investigate the discrepancies between my ORA results and existing literature on MKD.